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INTRODUCTION 


The  design  of  an  automatic  pattern  recognition  system  involves 
considerations  in  three  distinct  but  not  totally  independent  problem 
areas.  The  broad  selection  of  pattern  characteristics  and  the  design 
of  equipment  which  will  measure  these  characteristics  quickly  and 
efficiently  arc  two  aspects  of  the  system  which  must  be  resolved.  The 
third  is  the  formulation  of  a recognition  scheme  which  will  establish 
the  significant  characteristics  and  operate  upon  them  to  form  reliable 
pattern  classifications.  This  report  is  concerned  primarily  with  the 
classification  problem  and  discusses  some  applications  ol  Statistical 
decision  theory**  and  information  thcory**>to  the  design  of  a recognition 
scheme.  However,  ^ne  section  of  th^  report  presents  the  results  of 
an  experiment  and  the  other  problem  area's^wcrc  considered  to  the 
extent  that  characteristics  had  to  be  defined  and  measured  before  the 
data  could  be  collected.  ^ 

In  general,  the  two  theories  mentioned  above  complement  each 
other.  The  decision  theory  provides  the  method  for  optimizing  the 


i U:c  a u*  v lu  3 oii  n-ci«  i v>. 


<•%  ♦ V.  ^ 


i ri  f It  n rl  a t a 


the  information  theory  provides  the  method  for  optimizing  the  rate 
with  which  information  is  presented  to  the  decision  rule. 

The  general  structure  of  a recognition  system  would  appear  thus: 

source} — [encoder]- — [channel] 1 decoder} — [decision 

The  source  is  the  origin  of  the  pattern.  The  encoder  performs  an 
operation  upon  the  pattern  which  makes  it  compatible  with  some 
specified  channel  of  transmission.  The  decoder  reverses  the  process 
of  the  encoder  and  can  be  considered  as  the  element  which  measures 
the  pattern  characteristics.  The  decision  cment  then  classifies  the 
pattern.  Recognition  is  based  upon  v.hc  measurement  of  a set  of 
characteristics  and  patterns  with  similar  char,  teristics  categorized 
in  one  class.  Learning  is  employed  by  observing  and  recording  these 
characteristics  for  a finite  number  cf  patterns  from  each  class. 


A classical  example  of  a recognition  system  is  the  man-to-man 
communication  through  written  symbols.  A man  originates  the  pattern 
and  codes  it  in  written  form,  then  another  man  decodes  the  pattern  and 
classifies  it.  Of  course,  noise  usually  occurs  somewhere  in  the  sys- 
tem prior  to  decoding  and  complicates  the  decision  process.  Too 
much  noise  produces  an  illegible  pattern. 


AV  AW-ABLt  m 
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The  rules  of  decision  discussed  below  are  generally  referred  to 
as  Bayes'  rules  in  which  the  objective  is  to  perform  the  classification 
in  a manner  which  will  minimize  the  penalty  of  an  incorrect  decision. 
Two  penalties  or  loss  functions  are  considered.  In  the  first  case  the 
losses  which  will  be  suffered  by  each  decision  are  established  prior  to 
making  a decision;  in  the  second  case  the  loss  is  a function  of  the 
decision. 

It  is  generally  desirable  to  develop  the  probability  distributions 
of  the  decision  rule  through  a mathematical  model  of  the  pattern.  The 
model  chosen  for  this  discussion  is  representative  of  a binary  pattern 
or  signal  with  multiplicative  noise.  The  probability  distributions  are 
developed  for  both  the  condition  of  no  learning  and  the  condition  in 
which  learning  observations  are  used  to  enhance  the  knowledge  of  the 
statistics  of  the  model.  The  case  in  which  learning  is  not  applied  to 
the  model  follows  closely  the  development  by  Braverman  (Reference  1). 

The  decision  rules  presented  here  evaluate  the  importance  of 
each  pattern  or  signal  characteristic  only  on  a comparative  basis 
bctv.ccr.  ar.y  two  clzr'-**'-.  T«  or^r  to  evaluate  the  importance  of  each 
characteristic  with  reference  to  aii  classes  i.  e.  , the  idativC 
importance  within  the  concepts  of  a recognition  system,  it  is  proposed 
below  that  serious  consideration  be  given  to  the  expected  "mutual- 
information"  of  each  characteristic  as  it  is  defined  in  informatioti 
theory. 
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I.  STATISTICAL  DECISIONS 

1.  DECISION  RULE  V/ITH  COST  INDEPENDENT  OF  DECISION 

The  signal  or  pattern  space  will  be  designated  as 
^ " [wl*  w2’  ' * ■ * wnJ  w^ere  represents  the  i**1  class  of  signals  or 
patterns.  The  probability  distribution  over  O will  be 
P=  [ptw^,  p(w2) P<wn)J  • Let  X.«[X.(1),  X, (2),  ....  X.(v)J  be 

the  vector  valued  random  variable  of  the  character^  tics  of  the  trans- 
mitted signal  Xj  where  Xj(c)  is  the  value  of  the  c^  characteristic  and 
^i  * • The  valve  associated  with  a random  variable  will  be  denoted 

by  the  lower  case.  Let  Y = [ Y(l),  Y(2),  ....  Y(v)j  be  the  set  for  the 

receiveo  signal  and  let  the  set  of  decisions  associated  with  a received 

signal,  Y,  be  D = |d^,  • ••*  d^  | . The  cos*  function  W(w./w.)  is  a 

measure  of  the  loss  incurred  with  a decision  which  classifies  a signal 
from  the  ith  class  as  a signal  from  the  j-th  class.  The  risk,  R,  in- 
volved in  maKine  a decision  iror.-.  IJ  then  riene.irtQ  nrnn  th..  rmt 
and  the  probability  distribution  for  D given  a received  signal  y. 

n 

R[w  , D(y)|  = 2 W(\v./w.)p(d./y) 

l J j=l  J 1 J 

The  expected  risk  depends  upon  the  probability  distribution  for  y and  w.. 
n n 

E = 2 E Z W(w  /w  )p(d  / y)p(y,  w.) 

Y i=l  j=l  J 1 J i 

or , 

n n 

E -EE  E W(w./w.)p(d./y)p(y/w.)o(w.) 

Y i=l  j=l  J 1 3 11 

The  object  of  a decision  rule  is  to  minimize  this  expected  risk.  A 
decision  which  will  minimize  the  risk  for  each  received  signal,  y,  will 
also  possess  a minimized  risk  V’hc  a summed  over  all  y.  Thus,  it  is 
sufficient  to  choose  a decision  rule  which  will  minimize  the  following 
term: 


[ 
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, 1 


n n 


2 2 W(w./w.)p{d./y)p(y/w  )p(w  ). 

i=l  j=l  J 1 3 


If  it  is  assumed  that  the  decision  for  a rejection  can  be  ignored;  i.e., 
only  recognition  or  mis  recognition  are  considered  valid  decisions,  and 
if  in  addition  a mis  recognition  is  assumed  equally  detrimental  for  all 
classes,  then, 

W(w7w^)  = 0 for  i=j 
W(w./w.)  = 1 for  i*j 

With  these  values  for  the  cost  function  the  risk  becomes: 

n 

R jw.,D  (y)  ] = 2 p(d./y)  = probability  of  mis  recognition. 

I 1 J j=l  J 

i*i 

The  expected  risk  then  is. 
n 

Er  » 2 p(y/w.)p(w.) 

i=  1 


r n 

2 p(d./y) 

J=l  J 


Expanding  over  the  subscript  i gives, 


n 


Er  - P(  •/Wi)p{Wl)  2 p{d7y) 
j=2 


+ p(y/w Jp(w  ) 2 p(d  /y) 

2 * j-1  3 

}*Z 


n-  1 

* p<y/'vn)p(v'n)  p 


- v t,:  i 


o 


Oi 
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In  order  to  minimize  the  expected  risk  a decision  rule  must  be  chosen 
which  will  eliminate  the  largest  component  in  the  above  equation.  This 
rule  can  be  formulated  as  follows: 

DECISION:  Choose  class  k with  probability  1 and  all  other 

classes  with  probability  0 such  that  ' 

P^/y)  = 1 

and 

p(d. / y)  = 0 lor  all  j*k 

when  p(w  )p(y/w  ) £ p(w.)p(y/w.)  (1) 

K K J J 

for  all  j*k. 

This  is  generally  referred  to  as  a Bayes'  decision  rule  and  quite  often  ( 

written  in  the  form,  , 

t 

p(y/w.)  p(w  ) 

H.  > J 

p(y/w^  p(wk) 

j 

2.  DECISION  RULE  WITH  COST  DEPENDENT  UPON  DECISION 
Another  expression  for  the  cost  function  is, 

-log  p(w./d.). 

The  expected  risk  is  then, 


n n 


E = - E 

y 


E 2 log  p(w./d.)p(d./y)p(y/w.)p(w.)  . 
i=l  j=l  1 J J 

However,  the  following  probability  expression  can  be  rearranged  as 

PldVyMy/w^pfw.)  “ P(dj/y)p(wik  y)  = p(d7y)p(wi/y)p(y) 


* p(d  , w /y)p(y)  = p(d  , w , y) 

J J 
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Therefore, 


2 P(d./y)p(y/w.)p(w.)  = Zp(d,w,y)  = p(d  , w ) = p(d  /w  )p(w  ) 

yj  1 1 y J 1 J11 


Thus  the  expected  risk  can  be  rewritten  as: 


n n 

E = - 2 Z p(w.)p(d./w  ) log  p(w./d ■) 

i=l  j=l  J J 

This  is  precisely  the  definition  of  equivocation  entropy  in  "information 
theory."  With  this  cost  function,  a minimization  of  the  expected  risk 
involves  a minimization  of  "equivocation."  In  order  to  compute  the 
conditions  for  an  extremum,  when  any  pair  of  signals  are  being  con- 
sidered, divide  the  observation  space  into  two  fields,  X , and  X£»  then 
choose  d j if  yc  X i and  d^  if  ytX^.  If  f (y  / wj)  is  the  conditional  fre- 
quency function  for  the  signal,  y,  then  the  expected  risk  will  have  the 


E=  - Z Z f f(y/wi)p(d  /y)p(w  ) log  p(w  /d  ) (3) 

i=l j=l  Y J J 

The  object  is  to  find  the  conditions  associated  with  the  positioning  of 
the  boundary  between  the  two  fields  which  will  result  in  an  extremum 
in  the  expected  risk. 


Assume  that  this  extremum  exists  at  some  signal  value  y1  and 
that  X.  j.  includes  all  signals  between  yD  and  y'  while  X 2 includes  all 
signals  between  y'  and  y00.  An  example  of  this  would  be  the  trans- 
mission of  a voltage  with  values  between  yQ  and  y00  which  is  either 
signal  or  noise  with  the  signal  represented  by  a skewed  frequency 
function  biased  towards  the  high  voltages  and  the  noise  represented  by 
a skewed  frequency  function  biased  towards  the  low  valves.  The  value 
for  v'  would  be  the  voltage  which  divides  the  range  into  two  parts  such 
that  E^  ia  an  extreme.  The  log  term  in  equation  (3)  can  be  rewritten 
as, 


/. 


log  p(w^/d^.) 


log 


f(y/wi)p(d7y)wi)dy 


f P(d./y)f(y)  1. 

Jr?  J 


6 


) 


) 

j 

i 

I 

r 

1 


i , 

j 
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Then, 


y* 

* 

y' 

y' 

ER  = J f(y/w1)p(w1)dy 

l.jn 

1"  1 (y /w  j )p(w  j )dy  - log  J f(y)dy 

yo 

I-  y 

o 

yo 

y* 

*•  f f(y/w2)p(w2)dy 

log  j 

y' 

f(y/w2)p(w?)dy  - log 

y’ 

r f(y)dy 

yo 

Y0 

yoo 

yoo 

yoo 

+ J f(y/w1)p(w1)dy 

log 

[ f(y/'v1)p(w1)dy- log  J 

f(y)dy 

y* 

y 

1 y 

i 

y 

oo 

^ + J f(y/w2)p(w2)dy 

. 

yoo 

y 

' oo 
/* 

los  J 

f (y/w2)p(w2)dy  - log  <i 

f(y)dy 

y' 

y 

y 

At  the  extreme, 


3E 


R = o 


»y' 


(4) 


Then, 


^r*  = f(y'/w1)p(w1)  log  ptwj.dj)  - log  Pfdj)] 


y' 

+ J f(y/w|)p(wj)dy 

yo 


^y'/wjjptwj) 

y' 

f ffy/w^pfw^dy 


7 


i 

t 

i 

i 
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+ f(y'/w2)p(w2)  [log  pvw^.d^  - log  p(d^) 


y* 

<■  J f(y/w,)p(w2)dy 
yo 


f(y'/\v2(p(w2) 

J t'\Y  Av2)p(v^)dy 


f(y'/w1)p('v1)  [logp(wlfd2) 


log  P(d2>] 


+ f f(y'/w1)p(w1  )dy 

y’ 


f(y'/w1)p(wi) 

y_. 

J f(y/wj)p(\Vj)  dy 

y’ 


+ f<y') 

'oo 

J f(y)dy 
y' 


f(y'/w2)p(v^) 


dzi" 


oo 


^ l(y/w2)p(w2)dy 


f(y'/w2)p(w2) 


*<y') 


yoo  yoo 

f f(y/w2)p(w2)dy  J f(y)dy 

y'  y' 
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3 ER 


log  p(w2/d2) 


log  p(w./d  ) 

5yf'*f(y'/”i,P,”i)  - «y’/w,lP(w.,i;7eplw‘/^) 

p(w,.  d.) 

+ fty'/w^ptw^  - f(y')  --^(d  ) 


P^.d  ) 

^f(y7w2)P(w2)-f(y')— 5 - 


P(w.,  d ) 

^(y7w1)p(w1)  + f(y')  (d  ) 


- f(y'/\v2)p(\v2)  + f(y') 


P(w2»d2> 

P(d2) 


Jt  roiD  equation  (4), 


log  pfwj/dj) 


log  P(w2/d2) 


0 - '(y'/VKV  ;„g  p,„i/d2)  - Hy'/wz)pl»z)  Foa'p 


P(w,»d  ) + p(w  ,d  ) 

* «*•> 


p(Wi,  d2)  + p(w2>  d2) 

P(d^j 


Thus, 

p(w2/d2) 

^(y'/w^pfWj)  log  p(w2/d1) 

f(y'/v-’2)p(w2)  “ p(v.-j/dJ)  • 

108  P('Vj/d2) 

This  equation  is  the  basis  of  a likelihood- ratio  test  similar  to  the  one 
described  in  equation  (2)  but  with  a more  flexible  cost  function. 


9 
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II.  STATISTICAL  MODEL 
1.  MODEL  WITHOUT  LEARNING 

For  this  discussion  the  signal  or  pattern  considered  will  have  a 
binary  structure.  The  signal  would  appear  as  a series  of  pulses  with 
amplitude  of  plus  or  minus  one  (±1),  and  the  pattern  would  appear  as  a 
cellular  matrix  with  the  cells  interior  to  the  pattern  having  a value  of 
+ 1 and  those  in  the  background  field  having  a value  of  -1.  If  the  set  of 
perfect  signals  associated  with  the  signal  space  Q is 

S = Jsi.S^.  . ...  SnJ  then  the  set  of  transmitted  signals  and  the  re- 
ceived signal  can  be  considered  to  be  a perturbation  of  the  perfect 
signal  'iy  some  noise,  N.  Thus,  the  expression  for  the  signals  will  be. 


Xi  = SiNi  = [Si(1)Ni{1)’  ^(2)^(2).  ••••  S.(v)N.(v)] 


Y = SN  = [y(1)N(1),  Y(2)N(2),  ...,  Y(v)N(v)J. 


where  N^(c)  takes  on  the  values  of  +1  or  -1  with  the  following  prob- 
ability distribution. 


P [ N.(c)  - +1  ] - p.(c) 

P|NTi(c)=-lj  = l-p.(c)  = q^c) 

If  the  noise  components,  Ni(c),  are  independent  for  <11  c then  the 
probability  distribution  for  reception  of  y given  that  a member  of  w^ 
was  transmitted  is  the  product  of  the  probability  distributions  . >r  the 
noise. 


p(y/w  ) = 


v 


c~l 


P^c) 


l+n.(c) 

2 


l-n.(c) 

2 
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Howe  ver, 


y(c)  = s^(c)  • n. { .'■ 


when  \v.  is  transmitted.  Thus, 
i 


y(c)  * s.(c)  = n.(c) 


and 


l+y(c)s.(c) 


l-y(c)  • s.(c) 


p(y/w  ) = w 
1 c=l 


vAc) 


^(c) 


If  the  noise  components,  N.(c),  are  also  identically  distributed  for  all 
c then, 


v+  £ y(c)  • s.(c)  v-  £ y(c)*s  (c) 

c=l  1 c=l 


p(y/w.)  = [ | 


hi 


If  the  decision  rule  of  equation  (1)  is  applied  here,  then. 


Choose  d.  when: 

i 


v+  £ y(c)  • s,(c)  v-  Z y(c)  • o^(c) 

c=l  ’ c=l  


p(Wi)  |Pi] 


2 


M 


2 


P('V  l|,jj 


v + Z y(c)  • s ,'c) 
c=l  J 


V - Z y(c)  • s (c) 
c=l  } 


hi 


! '■ 
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( » 


If  an  additional  assumption  is  made  that. 


the  condition  for  the  decision  becomes, 


Z y(c)*s  (c) 
_ 1 * 


2 y (c)’s.(r) 


V 

2 y(c)  • s (c) 

c=l 

2 


2 y(c)  • s.{c) 

c — 1 

2 


> 


p(wj) 

p(w.) 


TaWinc  the  logarithm  gives. 


o 


o 


1/2 


v 

2 

c = l 


y(c)  ■ 


s^c) 


V 

r y(c)' 

c=  1 


Sj(c) 


] “wf 2 


p(w.) 
log  p(w.) 


If  all  classes  have  equal  probability  and  p > 1/2,  the  condition  for  a 
decision  is  reduced  further  to, 


v v 

2 y(c)  * s (c)  s 2 y(c)  • s (c) 
c=l  c=l  J 

Thus,  y would  be  classified  with  the  signal  which  contains  the  largest 
number  of  pulses  similar  to  y or  with  the  pattern  which  contains  the 
largest  number  of  cells  similar  to  y. 

2.  MODEL  WITH  LEARNING 

In  many  cases  of  interest  the  perfect  signal  is  unknown  and  the 
noise  probability  is  also  unknown.  It  is  desirable  then  to  gain  some 
knowledge  about  these  by  incorporating  a learning  process  prior  to 

making  a decision.  Let  the  signal  X^.  - | Xj^O),  Xjj.(2},  ...,  X£];(v)J 

be  the  l;1**  signal  from  the  class  w^  U3eci  in  the  learning  process  and 


i 


i 


* 

\ 

i 

4 

i 


l 
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; 
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let  Z =fx.,t  X.  ....  Xiv  1 be  the  set  of  k.  signals  from  the  ith  class 
i l il  i2  1Ki  ] 1 

applied  to  the  learning  process.  If  Z is  the  total  set  of  signals  for  the 
learning  process, 

z = [Z1§  z2,  ....  zn  j 

then  the  expected  r^isk  can  be  expressed  a<.: 
n n 

E = £ £ £ W(w./w.)p(d  /y)p(y/z,w  )p(w  /z)p(z) 

iv  Y.Z  i=l  j=l  J 1 3 

If  the  signal  classes  are  independent  of  the  learning  observations  then 
p(w  /z)  = p(w^) 

Following  the  reasoning  used  previously  in  the  mode’  without  learning 

' * - * - • - 

bile  UCClOiVtt  * - , 


DECISION  RULE:  Choose  class  k with  probability  1 and 

till  other  classes  with  prolability  0 such 
that 


P(dj./y)  = 1 

and 

p(d./y)  = 0 for  j 1 k 

J 

when 

p^wk'P(y/7v wk>  sp<wi)p<y/vwj)£or  (5) 


In  order  to  evaluate  the  coniitions  for  this  rule  the  probability  distribu- 
tion p(y/s.,w.)  must  be  computed.  This  distribution  can  be  repre- 
sented by  'the  ratio, 

p{y,z.,v.\)  Ply.  k.Av.) 

i • i * 


i 


i 

1 

j 


i 
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If  the  patterns  of  each  class  are  independent  then 


p(y.  z7w  ) = p(y/wi)p(K.1/wi)p(x.2/wi) 

. . . . p(x,  /w.) 

lk.  i 

l 

However,  from  the  previous  development. 

i+y(c)si(c) 

l-y(c)s.(c) 

v 2 2 

p(y/w.)  = ir  |p.(c)]  [qi(c)] 

and. 

ltXik(c)s.(c) 

UXik(c)Si(c) 

2 2 
p(Xik/wi)  = c;i  l P/C>  ] [ qi(c)  ] 

Then, 

(k.  + l)  + s^(c) 

r k.  i 

i 

y(c)  ♦ s x (c) 
k=  1 . 

V 

p(y,  z./w  )=  n [ p.(c)  1 

C”  1 * 1 

2 

r 

K 1 

(k.  + l)  - s.(c) 


y(c)+  2 X.k(c) 


k=l 


• jV'1] 

Since  the  Xjk(c)'s  and  the  y(c)  are  random  variables  corresponding  to 
kj+1  observations  and  the  Pj{c)  is  a parameter  to  be  estimated,  the 

probability  distribution  p |y(c),  zJLc)/wi  J has  the  form  of  the  likelihood 
function  L Jy(c),  X.  ^c),  Xj2(c),  ....  X.j  (c)j  . It  is  desired  to  find  the 


15 


l 


maximum  likelihood  estimator,  p^(c),  for  Pj(c).  This  can  be  ob'ained  by  differentiating  with  respect 
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0 = 


k.+  ll  f I s.(c')  [u.{c)+y(c)| 

T-Jh^o]  + — Hr J 


l-Pj(c) 


K<‘>j  j 


[f.(c)  |l-Pi(c)|  ] 


k.  + l 
i 

[ 1 -2p.  (c) 

+ si(c) 

2 

1 

Thus  the  estimator  is. 


P.(c)  = 


(k.+l)+s^(c)  |u.(c)+y(c)  j 


2 (k.  + l) 


z ctima^O’-  tf'~  " ♦+>**  r'^ob^bil’t''  Hi etribution  o(v  / w ) is 

* ‘ ‘ i l 


A k.*s  (c)u.(c) 

A / \ 11  1 

■>i(c)  3 — Zk. 


For  large  k., 


P AO 


(k^+M+sJc)  |u.(c)+y(c)J  k.+s. (c)u. (c)  l + s.(c)y(c) 


2(k.+l) 

l 


k.  + 1 

l 


k.  + l 

l 


P4(c)  k.  + s.  (c)u.  (c) 
i lit 


2k. 


k. +s.(c)u.  (c) 
11  1 

k. 

l 


p.(c) 

77T 

P;(c) 


^ 1.0 


17 


Then,  the  probability  distribution  p(y/z..w.)  becomes 


(k.  + l)  + s.(c)  |y(c)+u.{c)| 

(k^+l)-s^(c)  |y(c)+u.  (c)| 

:.,M  ‘ [' 

(c)j 

2 

k.+s.(c)u.(c) 
ii  i 

k.-s.(c)u.(c) 
ii  i 

2 

2 

v r*  1 
:>w] 

q.(c) 

1 

or, 

1 + bJc  )y(c) 

2 

p(y/z.,  w.)  = n [p^c) 


1 

J 


l-s.(c)y(c) 

[ V‘>] 


If  it  is  assumed  that  the  learning  process  is  biased  toward  the  perfect 
signal  such  that  p.  (c)>l/2, 


k, 

i 

s.(c)  = sign  u.(c)  = sign  2 x ..(c). 
1 1 lt«l 


With  this  assumption  the  probability  distribution  becomes, 


k. 

i 


l+y(c).  sign  jTj  x.k 


k, 

i 

l-y(c)sign  kEj  x>;k 


(6) 


p(y/z.,  w.) 


2 


2 


TO: asssm 


3HS 


where 
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k. 


P.(c)  = 


(k.+l)  +sign  z;1  x.,  (c ) 

1 k=  1 


y(c)+  s x (c) 
k=  1 


2(k.  + l) 


(7) 


It  would  facilitate  the  mechanization  of  the  decision  if  the  rule 
expressed  in  equation  (1)  could  be  put  into  the  linear  form, 


G + Z F (c)y(c)  > G.  + E F.(c)y(c) 
k c=l  k J c=l  3 


This  can  be  accomplished  by  letting 


C,  + E Fu(c)y(c)  = log  p(ww )p(y/zk>  w k>  . 


r=  I 


D 


Then, 


Gj  + E P'k(c)y(c)  = log  p(wk)+log  p (y/zk,  wk) 
c=  1 


or, 


G + E Fj  (c)y(c)  = log  p(wk)  + E 
c=  1 ' c=l 


l+y(c)sign  u (c) 

— — log  Pk(c) 


J-y(c)sign  u (c) 

+ 2 — l0^k(c) 


G + E F (c)y(c)  = log  p(w  ) + E rr 

K , iC  K , c 

C=1  C=1 


log  Pj.(c)  + log  q,c(c) 


3 


1 

+ £ ~ y(c)  sign  u (c)  . 

,4.  K 

C-l 


log  Pk(c)  - log  qk(c) 


(3) 
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Thus, 

log  P,  (c)  + l°g  R.  ) 

K K 

and 

Fk(c)  = Y sign  «k(c)[log  Pk(c.)  - log  qk(c)J 

3.  INFORMATION  THEORY 

If  the  value  for  y(c)  in  equation  (8)  has  the  same  sign  as  uR(c) 

then, 

Gk  + Fk^c)'  y*c)  = l0g  p{'Vk)  + l0g  Pk^‘ 

• . r / _ \ 1.1.  M w 

if  vlri  oor.:ccses  tne  opposite  sjgn 

Gk  + Fk(c)y(c)  = log  p(wk)  + log  VO 

In  "information  theory"  the  terms 
J(wj  ) = -log  p(wR) 


G = log  p(w  ) + E - 
k k c=l 


and 


I(c/wk)  = -logpk(c) 


or 


I(c/\v)  ) = -log  qR(c) 

are  referred  to  as  the  "self-information."  Using  the  definitions  of 
"information  theory,  " the  decision  rule  would  be  a summation  of  the 
cclf- information  associated  with  a particular  class  plus  the  sclf- 
information  associated  with  the  learning  process  for  each  component 
or  characteristic  of  that  class.  If  the  probability  of  the  classes  are 

equal  i.c.,  p(w  ) - p(w.),  the  decision  rule  is  based  upon  a 

K J 
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comparison  of  the  self-information  of  a particular  component  or 
characteristic  of  one  class  against  the  self-information  of  this  same 
characteristic  in  another  class.  The  average,  or  expected  amount  of 
self-information  for  a specified  component  of  one  class,  is  expressed 
in  equation  (9)  where  the  following  nomenclature  change  has  been  made 
to  be  consistent  with  other  reports  on  information  theory: 
p (c)  = p(c/wk)  and  c,k(c)  = q(c/wjJ 

E | I(c / w^)  j = p(c/wk)  log  p(c/wfc)  + q(c/wk)  log  q(c/v.'k).  (9) 

The  average  self-information  for  one  characteristic  and  for  all 
classes  then  is, 


n . 

|l(cAl)j  = Z ?(wk)  [p(c/wk>  log  p(c/wk)  + q(c/wfc)  log  q(c/wk|(10) 
k — 1 


• ■ ■ . , C-  — - * J — { o ac  a -i  an  sttpmnt  is  made  to 

A IJUC3UUII  VJ , --  ' — 1 ' * - . 

evaluate  the  information  obtained  about  a particular  class  of  signals  by 
comparing  the  self-information  of  a specified  component  for  two 
classes.  That  is,  which  components  provide  significant  information 
about  the  transmitted  signal  class  and  which  ones  could  be  ignored  or 
not  even  transmitted  without  loss  of  information  about  the  signal 
class?  A measure  of  the  importance  _‘s  contained  in  the  statistical 
dependence  between  the  characteristics  and  the  signal.  Information 
theory  suggests  the  following  function  for  measuring  this  quality. 

p(c>  wk> 

I(c,\v  ) = log  . ~ : 

k p(c)p(w  ) 


or. 


q(c-  wk) 

I(c,w.)  = log  . . . r 

k ° q(c)p(wk) 

This  term  is  referred  to  as  "mutual-information"  and  the  expected  or 
average  mutual  information  is, 


p(c,wk) 


q(c,  wk) 


|i“’V|  * J,  p<c’"'k)  108  p(c)P'(Sy  + '1<c'wk),:'35ki?Tv  ' 
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Expanding,  this  gives, 


n 

Z p(wk)p(c/wk)-  log  p(c/wk)+p(wk)q(c/\vk)  • log  q(c/\vk) 
. k=  1 


n 

- Z p(c,  w ) log  p(c)  + q(c,  w ) log  q(c) 
k=  1 


n 

- p (c ) log  p(c)i-q(c)  log  q(c ) - £ 

k=  1 


p(wk)  p(c/wk) 


• log  p(c/\vk)  + q(c/wk)  log  q(c/\vk)j  j (11) 

TV>nP  tKf»  expected  mutual-information,  referred  toes  "transinformation," 
is  tne  expeciea  seif- information  m i:;c  LOiiiyuiiciu,  uiiCOnJ,t:cnCu  by 
the  signals,  minus  the  expected  self-information  of  equation  (10). 

These  terms  are  referred  to  as  entropy  and  designated  by  H.  Using 
the  nomenclature  of  information  theory,  equation  (11)  becomes. 

KC.fi)  = H(C)  - H(C/fi) 

4.  AN  EXPERIMENT  WITH  Y/RITTEN  NUMERALS 

In  order  to  explore  the  applications  of  the  theory  developed  above, 
a.  simple  experiment  was  performed  with  s<  'en  samples  of  the  ten 
numbers  0 through  9.  The  experimental  data  were  collected  by  having 
seven  individuals  print  the  numbers  0 through  9 inside  a 1 /4-inch 
square  (see  Figure  1).  The  formation  of  these  numbers  was  controlled 
to  the  extent  that  an  original  pattern  set  was  provided  as  an  example 
and  only  straight  lines  were  permitted.  The  numbers  were  then 
magnified  and  quantized  into  a 16  x 16  cellular  matrix  as  shown  for  the 
number  4 in  Figure  2.  The  first  six  samples  were  chosen  as  the 
learning  set  and  the  seventh  sample  was  used  (o  check  the  effective- 
ness of  the  recognition  scheme. 

Three  sets  of  data  were  tabulated  from  the  learning  patterns  for 

each  pattern  class.  The  first  ret  of  data  was  S X.,  for  each  cell  of 

Z »k 

k=  1 
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i. 


the  16  by  16  matrix.  An  example  of  the  resulting  values  for  the  number 
4 is  shown  in  Figure  3.  The  other  two  sets  of  data  involved  the 
generation  of  a transformed  pattern  from  the  original  by  observing  the 
number  of  pattern  crossings  a scanning  system  would  encounter  in  the 
vertical  and  horizontal  directions  if  the  scanning  resolution  was 
comparable  to  the  quantizing  of  the  pattern  into  the  16  x 16  matrix. 

The  transformed  patterns  for  the  vertical  scan  are  embodied  in  a 
4x16  cellular  matrix  in  which  the  16  columns  represent  the  16  scan 
lines  and  the  4 rows  represent  the  number  of  crossings  encountered. 

An  example  of  this  matrix  for  a 1 is  shown  in  Figure  4.  The  trans- 
formed patterns  obtained  by  horizontal  crossing  are  contained  in  a 
2 x 16  matrix  and  an  example  of  this  for  the  number  4 is  shown  in 
Figure  5.  Each  cell  of  the  4x16  matrix  and  the  2 x 16  matrix  was 
then  treated  in  the  same  manner  as  the  cells  of  the  original  pattern 

k. 

matrix  in  order  to  establish  the  statistic  £ X . Examples  of  these 

k=l  1 ' 

are  shown  in  Figures  6 and  7. 


> 1 J 


The  decision  rule  of  equation  (5)  and  the  statistical  model 

for  equations  {6}  '-r.'1  (7)  r nnnlirri  to  the  statistics  oi  me 
transformed  patterns  as  a recognition  scheme  for  the  seventh  sample 
of  patterns.  The  form  of  the  decision  rule  which  was  used  here  was 
the  summation  of  the  logarithm  of  the  probabilities.  The  results  for 
the  horizontal  crossings  are  shovm  in  Table  1.  The  encircled  values 
denote  the  pattern  classes  where  confusion  has  occurred  because  the 
values  are  larger  than  those  for  the  proper  pattern  classes  which  occur 
along  the  diagonal.  Table  2 represents  the  same  recognition  scheme 
for  the  vertical  crossings.  The  misrecognition  present  in  these  two 
transform  ed  patterns  can  be  eliminated  by  combining  the  two  as  a 
single  transformed  pattern  in  a 6 by  16  matrix.  The  results  of  the 
recognition  scheme  for  the  combined  transformed  patterns  is  shown  in 
Table  3. 


A simplification  of  the  decision  rule  and  the  statistical  model  was 
applied  to  the  statistics  of  the  16  x 16  matrix  of  the  original  patterns 
for  recognition  of  the  seventh  sample  of  patterns.  This  simplification 
involved  the  summing  of  the  probabilities  for  the  cells  instead  of  the 
logarithm  of  the  probabilities  where  the  probabilities  were  computed 
from  the  following  equation, 


k r k. 

k.  + sign  21  X (r)  E 1 X (c) 

k-i  _ U--i 

2k.  “ 

i 
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Fifmre  3.  A Compilaticn  of  the  Values  for  £ X.,  for  the  Hand 

k=l 

Printed  4s  Occurring  in  the  First  Six  Columns  in  Figure  1. 

6 

(Specifically  These  are  Values  for  ,2  X ) 

k=l  4k 
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SCAN  LINES  Number  of  Vertical  Crossings 

1 2 3 4 5 6 7 8 9 10  1 1 12  13  14  15  16 


Figure  4.  Transformed  Pattern  for  the  Vertical  Scan  Lines 
of  the  Number  4 Shov/n  in  Figure  2, 


Figure  5.  Transformed  Pattern  for  the  Horizontal  Scan 
Lines  of  the  Number  4 Shown  in  Figure  2. 
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Figure  6.  2 X,.  For  the  Vertical  Scar.  Pattern  of  the  4s 
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Figure  7.  2 X' , For  the  Horizontal  Scan  Patterns  of  the  4s. 


Table  1.  Recognition  Matrix  for  the  Seventh  Set  of  Patterns  Formed 
by  Summing  the  Log  of  the  Probabiliti  : n for  the  64  Cells  of  the 
Horizontal  Crossing  Transformation 


Table  2.  Recognition  Matrix  for  the  Seventh  Set  of  Patterns  Formed 
by  Summing  the  Log  of  the  Probability :i  for  the  32  Cells  of  the 
Vertical  Crossing  Transformation. 
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Table  Z.  Recognition  Matrix  for  the  Combined  Res 
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The  results  of  this  recognition  scheme  are  shown  in  Table  4.  Once 
again  the  enclosed  values  represent  pattern  classes  where  confusion 
has  occurred.  Only  three  of  the  ten  numerals  are  correctly  classified 
for  this  particular  scheme. 

It  would  seem,  from  the  discussion  of  information  theory  above, 
that  a large  number  of  cells  could  be  eliminated  from  the  16  x 16 
matrix  on  the  basis  of  transinformation  content  without  incurring  a 
proportionate  deterioration  in  recognition  capabilities.  In  order  to 
evaluate  this  scheme,  the  transinformation  for  each  of  the  256  cells 
was  computed.  These  values  are  shown  in  Figure  8.  A transinforma- 
tion value  of  0.  30  was  arbitrarily  chosen  as  a lower  limit  and  all 
cells  with  values  equal  to  or  greater  than  this  were  used  with  the 
simplified  recognition  scheme  described  above  tc  recompute  the 
recognition  matrix.  The  distribution  of  the  55  cells  (approximat'  , 

1/5  of  the  total  number)  with  values  equal  to  or  greater  than  0.  30  is 
shown  in  Figure  9.  The  results  of  the  recognition  scheme  are  shown 
in  Table  5.  A comparison  between  these  results  and  the  values 
obtained  by  using  all  of  the  cells,  as  presented  in  Table  6,  shows  a 

in  tli r tun-'.'.-ey  cl  contusions  irum  r.C  io  iO  Olid  oil  incituSC  ill 
the  number  of  correct  recognitions  from  3 to  5.  As  a contrast,  the 
55  cells  with  the  lowest  transinformation  content  (See  Figure  10)  were 
also  used  with  this  recognition  scheme.  These  results,  presented  in 
Table  6,  indicate  how  insignificant  their  contribution  is  with  regards  to 
recognition  of  the  pattern  classes. 


f~\ 
, t 


31 


EM  1162-141 


0 

0 

.06 

. 03 

.06 

. 06 

0 

. 12 

. 13 

.Cb 

.08 

. 08 

.06 

0 

.06 

. 06 

.06 

. 06 

J 

. 38 

.23 

.24 

. 26 

.20 

. 12 

. 19 

.22 

. 20 

. 16 

.09 

.08 

. 06 

. 06 

.08 

. 26 

.29 

.20 

. 13 

. 19 

.29 

.21 

.27 

.28 

. 26 

0 

_ 

.09 

.38 

.26 

.25 

. 15 

. 11 

. 12 

.11 

.18 

.27 

.24 

.33 

.30 

. 15 

.08 

0 

. 08 

.30 

.41 

.15 

.27 

. 12 

.30 

.23 

.33 

.26 

. 26 

.21 

.21 

. 13 

0 

0 

.08 

.27 

.40 

.23 

. 15 

. 13 

.27 

.42 

.25 

.24 

.24 

.21 

.23 

B 

0 

0 

. 08 

.25 

.41 

. 10 

. 17 

.30 

.36 

.30 

.25 

. 14 

.24 

. 15 

. 27 

.20 

0 

! r. 

? 1 

* 

77 

‘•.S 

. 13 

. 30 

. 3*i 

. ti 

. 3b 

• 1 

• "*  * 

O 1 
* * 

•»  1 
• w . 

A 

v i 

. 06 

. 13 

.30 

.47 

.42 

.38 

. 28 

. 13 

. 17 

.29 

.44 

.33 

. 38 

.39 

.21 

0 

. 08 

.24 

. 37 

.23 

.32 

.27 

.35 

.21 

.24 

. 15 

. H 

.21 

. 32 

0 

. 06 

.23 

. 30 

.24 

.27 

.27 

m 

.34 

. 34 

.23 

.23 

. 15 

. 34 

. 25 

0 

0 

. 08 

.25 

.45 

. 19 

.22 

. 13 

. 13 

.24 

.41 

.25 

.25 

.35 

.40 

.30 

0 

0 

. 13 

.34 

.40 

.28 

. 13 

. 11 

. 14 

. 14 

. 18 

. 14 

.32 

. 23 

. 25 

.25 

. 08 

. 12 

. 17 

.34 

.29 

.29 

.29 

.24 

■ 

.23 

.22 

.23 

.27 

.31 

.31 

.24 

. 18 

. 06 

. 06 

. 17 

.20 

. 33 

.35 

.24 

. 18 

.13 

. 13 

.21 

.25 

B 

.21 

. 13 

0 

0 

. 12 

.11 

.11 

. 13 

.11 

.11 

.11 

.09 

. 11 

.11 

1°1 

.11 

111! 

0 

0 

Figure  8.  Transinformation  Values  as  Computed  From 
Equation  10  for  Each  of  the  Ceils  in  the 
16  j:  16  Matrix. 
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Table  6.  Recognition  Matrix  for  the  Seventh  Set  of  Patterns  Formed 
by  Summing  the  Probability  for  the  Blackened  Cells 
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III.  CONCLUSIONS 

The  inadequacies  of  the  experiment  preclude  any  quantitative 
conclusions.  However,  the  results  do  indicate  two  areas  in  which 
further  investigations  could  be  rewarding.  One  is  the  design  of  a 
recognition  scheme  which  would  automatically  read  controlled  hand 
printed  characters.  The  other  is  the  development  of  a procedure  for 
evaluating  and  weighting  pattern  characteristics  with  respect  to  their 
discriminating  qualities. 

A certain  degree  of  control  is  required  in  the  printing  of  char- 
acters even  when  the  recognition  is  to  be  performed  by  humans.  It  is 
reasonable  to  expect  that  the  complexity  of  a machine  to  recognize 
hand  printed  characters  is  dependent  to  a large  extent  upon  how  much 
responsibility  can  be  put  upon  the  printer  for  controlling  the  distortion 
of  the  characters.  In  the  area  of  the  computer  progiammer,  a number 
of  restrictions  have  already  been  applied  with  regards  to  where  and 
how  the  program  is  to  be  written  and  the  legibility  required.  If  addi- 
tional contiols  (which  require  the  programmer  to  imitate  a particular 
Set  OI  rtr  «mm  ijiie  uunitri  ms  maiuc  vjj.  a box  or  <t.  printed  V.  ci'C 

acceptable  it  appears  that  a simple  pattern  recognition  scheme  couid 
be  devised  to  automatically  read  and  punch  computer  programs.  Of 
course,  the  problem  is  two  sided  in  that  the  amount  of  distortion 
acceptable  to  the  recognition  scheme  must  not  be  too  restrictive  with 
respect  to  the  human  capabilities.  In  order  to  answer  questions  aris- 
ing from  the  problems,  further  studies  would  be  required  to  establish 
machine  complexity  versus  distortion  of  characters.  Further  studies 
would  also  be  required  to  define  the  capabilities  of  humans  to  print 
controlled  characters,  including  considerations  for  the  time  element 
and  increased  distoition  caused  by  fatigue. 

The  area  of  evaluation  of  pattern  characteristics  has  general 
applications.  It  has  been  shown  that  transinformation  has  a definite 
bearing  on  the  discriminating  qualities  of  the  characteristic.  However, 
the  linearity  of  this  relationship  ano  the  effects  of  statistical  depen- 
dence between  characteristics  has  not  been  considered.  In  order  to 
establish  some  quantitative  measures  in  this  area  it  would  be  neces- 
sary to  perform  a more  comprehensive  experiment  on  some  particular 
set  of  pattern  classes.  The  initial  problems  confronting  such  an 
experiment  are  the  compiling  of  the  sampled  patterns,  data  processing 
for  computer  application,  and  computer  programing  to  tabulate  the 
statistics.  It  would  also  be  of  considerable  interest  lo  perform  a com- 
parison of  the  evaluation  of  characteristics  obtained  with  information 
theory  techniques  and  the  adaptive  memory  process. 
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